CRI RNAseq Pipelines

RUN AS PRACTICE

Pipeline package is available on CRI Bioinformatics FTP

The Center for Research Informatics (CRI) provides computational resources and expertise in biomedical informatics for researchers in the Biological Sciences Division (BSD) of the University of Chicago.

As a bioinformatics core, we are actively improving our pipelines and expanding pipeline functions. The tutorials will be updated in a timely manner but may not reflect the newest updates of the pipelines. Stay tuned with us for the latest pipeline release.

If you have any questions, comments, or suggestions, feel free to contact our core at bioinformatics@bsd.uchicago.edu or one of our bioinformaticians.

Quick Start | Top

  1. modify the generator script Build_RNAseq.CRI-BIO-646.sh accordingly
    1. project=“PROJECT_AS_PREFIX” (e.g., CRI-BIO-646 which is used as a prefix of metadata file CRI-BIO-646.metadata.txt and configuration file CRI-BIO-646.pipeline.yaml)
    2. padding=“DIRECTORY_NAME_CONTAINING_PROJECT_DATA” (e.g., CRI-BIO-646 which is the folder name to accommodate metadata file, configuration file, sequencing data folder, and references folder)
  2. prepare metadata file as the example CRI-BIO-646.metadata.txt
    1. Single End (SE) Library
      1. Set Flavor column as 1xReadLength (e.g., 1x50)
      2. Set Seqfile1 column as the file name of the repective sequencing file
    2. Paired End (PE) Library
      1. Set Flavor column as 2xReadLength (e.g., 2x50)
      2. Set Seqfile1 column as the file name of the repective read 1 (R1) sequencing file
      3. Set an additional column named ‘Seqfile2’ as the file name of the repective read 2 (R2) sequencing file
    3. Non strand-specific Library
      1. Set LibType column to NS
    4. Strand-specific Library
      1. Inquire the library type from your seuqencing center and set LibType column to FR (the left-most end of the fragment (in transcript coordinates, or the first-strand synthesis) is the first sequenced) or RF (the right-most end of the fragment (in transcript coordinates) is the first sequenced, or the second-strand synthesis). You can read this blog for more details of strand-specific RNA-seq.
  3. prepare reference files as the example reference hg38 under [/gpfs/data/bioinformatics/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12)
  4. prepare pre-built STAR index files as the example reference hg38 under [/gpfs/data/bioinformatics/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12/STAR)

Introduction | Top

RNA sequencing (RNA-seq) is a revolutionary approach that uses the next-generation sequencing technologies to detect and quantify expressed transcripts in biological samples. Compared to other methods such as microarrays, RNA-seq provides a more unbiased assessment of the full range of transcripts and their isoforms with a greater dynamic range in expression quantification.

In this tutorial, you will learn how to use the CRI’s RNA-seq pipeline (available on both CRI HPC cluster and GitHub)) to analyze Illumina RNA sequencing data. The tutorial comprises the following Steps:

By the end of this tutorial, you will:

This tutorial is based on CRI’s high-performance computing (HPC) cluster. If you are not familiar with this newly assembled cluster, a concise user’s guide can be found here.

Work Folow | Top

The RNA-seq data used in this tutorial are from CRI-BIO-646-BMB-RKeenan.

In this tutorial, we use the sequencing reads in the project CRI-BIO-646 in mouse as example. The sample information are saved in the file CRI-BIO-646.metadata.txt (see below).

Work Flow

Work Flow

Data Description | Top

There are six (partial) single-end RNA-seq sequencing libraries will be used as the example dataset In this tutorial. Their respective sample information is described in the metadata table CRI-BIO-646/CRI-BIO-646.metadata.txt.

Sample Description
Sample Library ReadGroup LibType Platform SequencingCenter Date Lane Unit Flavor Encoding Run Genome NucleicAcid Group Location Seqfile1
WT01 WT01 BK-AA-1_S11_L007_R355 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 355 hg38 rnaseq WT CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ BK-AA-1_S11_L007_R1_001.fastq.gz
WT02 WT02 BK-AA-2_S12_L007_R355 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 355 hg38 rnaseq WT CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ BK-AA-2_S12_L007_R1_001.fastq.gz
WT03 WT03 BK-AA-3_S13_L007_R355 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 355 hg38 rnaseq WT CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ BK-AA-3_S13_L007_R1_001.fastq.gz
KO01 KO01 BK-AA-4_S14_L007_R355 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 355 hg38 rnaseq KO CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ BK-AA-4_S14_L007_R1_001.fastq.gz
KO02 KO02 BK-AA-5_S15_L007_R355 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 355 hg38 rnaseq KO CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ BK-AA-5_S15_L007_R1_001.fastq.gz
KO03 KO03 BK-AA-6_S16_L007_R355 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 355 hg38 rnaseq KO CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ BK-AA-6_S16_L007_R1_001.fastq.gz
WT01 WT01 BK-AA-1_S11_L007_R357 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 357 hg38 rnaseq WT CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ BK-AA-1_S11_L007_R1_001.fastq.gz
WT02 WT02 BK-AA-2_S12_L007_R357 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 357 hg38 rnaseq WT CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ BK-AA-2_S12_L007_R1_001.fastq.gz
WT03 WT03 BK-AA-3_S13_L007_R357 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 357 hg38 rnaseq WT CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ BK-AA-3_S13_L007_R1_001.fastq.gz
KO01 KO01 BK-AA-4_S14_L007_R357 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 357 hg38 rnaseq KO CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ BK-AA-4_S14_L007_R1_001.fastq.gz
KO02 KO02 BK-AA-5_S15_L007_R357 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 357 hg38 rnaseq KO CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ BK-AA-5_S15_L007_R1_001.fastq.gz
KO03 KO03 BK-AA-6_S16_L007_R357 RF Illumina GCF 2018-11-07 7 K00242 1x50 33 357 hg38 rnaseq KO CRI-BIO-646/data/180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10/FastQ BK-AA-6_S16_L007_R1_001.fastq.gz

Prerequisites | Top

We will use SSH (Secure Shell) to connect to CRI’s HPC. SSH now is included or can be installed in all standard operating systems (Windows, Linux, and OS X).

Login and Setup Tutorial Working Directory | Top

The login procedure varies slightly depending on whether you use a Mac/Unix/Linux computer or a Windows computer.

  • Log into one of entry nodes in CRI HPC
    1. Open a terminal session.
    2. Connect to the login node of the CRI HPC cluster:

      $ ssh -l username@gardner.cri.uchicago.edu
    3. If it’s your first time to log in, you will be asked to accept the ssh key. Type “yes
    4. Type in the password when prompted

      Make sure that you replace username with your login name.

  • CAUTION
    • THIS PACKAGE IS LARGE, PLEASE DO NOT DOWNLOAD IT TO YOUR HOME DIRECTORY
    • USE OTHER LOCATION LIKE /gpfs/data/bioinformatics/username
  • Set up a tutorial directory
    1. You should be in your home directory after logging in

      $ pwd
      /home/username
      $ cd /gpfs/data/bioinformatics/username; pwd
      /gpfs/data/bioinformatics/username
    2. Instead of downloading the pipeline package to your local home directory, use other location like /gpfs/data/bioinformatics/username

      $ cd /gpfs/data/bioinformatics/username; pwd
      /gpfs/data/bioinformatics/username
  • Download the pipeline package
    1. One way to download the pipeline package via git clone

      $ cp ftp://logia.cri.uchicago.edu/bioinformatics/tutorials/Nov2018/CRI-BIO-646-BMB-RKeenan.tgz .
    2. Uncompress the tarball file

      $ tar -zxvf CRI-BIO-646-BMB-RKeenan.tgz
    3. Change working directory to pipeline dirctory

      $ cd CRI-BIO-646-BMB-RKeenan
      $ tree -d
      |-- CRI-BIO-646
      |   |-- data
      |   |   |-- 180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10
      |   |   |   `-- FastQ
      |   |   `-- 180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10
      |   |       `-- FastQ
      |   `-- references
      |       `-- v28_92_GRCh38.p12
      |           `-- STAR
      |-- SRC
      |   |-- Python
      |   |   |-- lib
      |   |   |-- module
      |   |   `-- util
      |   `-- R
      |       |-- module
      |       `-- util
      `-- docs
          `-- IMG
  • File structure
    • Raw sequencing data files (*.fastq.gz) are located at CRI-BIO-646/data/

      $ tree CRI-BIO-646/data/
      |-- 180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10
      |   `-- FastQ
      |       |-- BK-AA-1_S11_L007_R1_001.fastq.gz
      |       |-- BK-AA-2_S12_L007_R1_001.fastq.gz
      |       |-- BK-AA-3_S13_L007_R1_001.fastq.gz
      |       |-- BK-AA-4_S14_L007_R1_001.fastq.gz
      |       |-- BK-AA-5_S15_L007_R1_001.fastq.gz
      |       `-- BK-AA-6_S16_L007_R1_001.fastq.gz
      `-- 180223_K00242_0357_AHT37VBBXX-RKeenan-AA-RS10
          `-- FastQ
              |-- BK-AA-1_S11_L007_R1_001.fastq.gz
              |-- BK-AA-2_S12_L007_R1_001.fastq.gz
              |-- BK-AA-3_S13_L007_R1_001.fastq.gz
              |-- BK-AA-4_S14_L007_R1_001.fastq.gz
              |-- BK-AA-5_S15_L007_R1_001.fastq.gz
              `-- BK-AA-6_S16_L007_R1_001.fastq.gz
    • Genome data are located at /gpfs/data/bioinformatics/ReferenceData/cri_rnaseq_2018/vM18_93_GRCm38.p6

      $ tree /gpfs/data/bioinformatics/cri_rnaseq_2018_ex/example/references/v28_92_GRCh38.p12
      |-- GRCh38_rRNA.bed
      |-- GRCh38_rRNA.bed.interval_list
      |-- STAR
      |   |-- Genome
      |   |-- Log.out
      |   |-- SA
      |   |-- SAindex
      |   |-- chrLength.txt
      |   |-- chrName.txt
      |   |-- chrNameLength.txt
      |   |-- chrStart.txt
      |   |-- exonGeTrInfo.tab
      |   |-- exonInfo.tab
      |   |-- geneInfo.tab
      |   |-- genomeParameters.txt
      |   |-- run_genome_generate.logs
      |   |-- sjdbInfo.txt
      |   |-- sjdbList.fromGTF.out.tab
      |   |-- sjdbList.out.tab
      |   `-- transcriptInfo.tab
      |-- genes.gtf
      |-- genes.gtf.bed12
      |-- genes.refFlat.txt
      |-- genome.chrom.sizes
      |-- genome.dict
      `-- genome.fa
  • Pipeline/project related files
    • project related files (i.e., metadata & configuration file) as used in this tutorial are located under CRI-BIO-646/

      $ ls -l CRI-BIO-646.*
      |-- CRI-BIO-646.metadata.txt
      |-- CRI-BIO-646.pipeline.yaml
      • Here are the first few lines in the configuration example file CRI-BIO-646/CRI-BIO-646.pipeline.yaml

        ---
        pipeline:
          flags:
            aligners:
              run_star: True
            quantifiers:
              run_featurecounts: True
              run_rsem: False
              run_kallisto: False
            callers:
              run_edger: True
              run_deseq2: True
              run_limma: True
          software:
            main:
              use_module: 0
              adapter_pe: AGATCGGAAGAGCGGTTCAG,AGATCGGAAGAGCGTCGTGT
              adapter_se: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
              fastq_format: 33
              genome_assembly: hg38

      When running on another dataset, you will need to modify these two files and the master pipeline script (i.e., Build_RNAseq.CRI-BIO-646.sh) (as described below) accordingly.

      For instance, if you would like to turn off the DE analysis tool limma, you can set the respecitve paramter to ‘False’ in configuration file
      - run_limma: False

      For metadata file, you might pay attendtion on the following settings
      1. Single End (SE) Library
        1. Set Flavor column as 1xReadLength (e.g., 1x50)
        2. Set Seqfile1 column as the file name of the repective sequencing file
      2. Paired End (PE) Library
        1. Set Flavor column as 2xReadLength (e.g., 2x50)
        2. Set Seqfile1 column as the file name of the repective read 1 (R1) sequencing file
        3. Set an additional column named ‘Seqfile2’ as the file name of the repective read 2 (R2) sequencing file
      3. Non strand-specific Library
        1. Set LibType column to NS
      4. Strand-specific Library
        1. Inquire the library type from your seuqencing center and set LibType column to FR (the left-most end of the fragment (in transcript coordinates, or the first-strand synthesis) is the first sequenced) or RF (the right-most end of the fragment (in transcript coordinates) is the first sequenced, or the second-strand synthesis). You can read this blog for more details of strand-specific RNA-seq.
    • Master pipeline script

      $ cat Build_RNAseq.CRI-BIO-646.sh
      ## build pipeline scripts
      
      now=$(date +"%m-%d-%Y_%H:%M:%S")
      
      ## project info
      project="CRI-BIO-646"
      SubmitRNAseqExe="Submit_${PWD##*/}.sh"
      padding="CRI-BIO-646/"
      
      ## command
      echo "START" `date` " Running build_rnaseq.py"
      python3 SRC/Python/build_rnaseq.py \
          --projdir $PWD \
          --metadata $PWD/${padding}$project.metadata.txt \
          --config $PWD/${padding}$project.pipeline.yaml \
          --systype cluster \
          --threads 8 \
          --log_file $PWD/Build_RNAseq.$project.$now.log
      
      ## submit pipeline master script
      echo "START" `date` " Running $SubmitRNAseqExe"
      echo "bash $SubmitRNAseqExe"
      
      echo "END" `date`

      Basically, when running on your own dataset, you will need to modify this master pipeline script (i.e., Build_RNAseq.CRI-BIO-646.sh) accordingly.

      For instance, you can change respective parameters as follows.
      - project=“PROJECT_AS_PREFIX” (e.g., CRI-BIO-646 which is used as a prefix of metadata file CRI-BIO-646.metadata.txt and configuration file CRI-BIO-646.pipeline.yaml)
      - padding=“DIRECTORY_NAME_CONTAINING_PROJECT_DATA” (e.g., CRI-BIO-646 which is the folder name to accommodate metadata file, configuration file, sequencing data folder, and references folder)

Pipeline Steps | Top

Step 1: Quality Control | Top

For the first step, the pipeline will perform quality assessment on the raw fastq files.

The BDS code snippet for the sample KO01 will look like:

$ grep -A1 run.RawReadQC.FastQC.BK-AA-4_S14_L007_R355.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/RawReadQC/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R1_001_fastqc.zip' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ/BK-AA-4_S14_L007_R1_001.fastq.gz' ], cpus := 1, mem := 16*G, timeout := 72*hour, taskName := "FastQC.BK-AA-4_S14_L007_R355") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.RawReadQC.FastQC.BK-AA-4_S14_L007_R355.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/RawReadQC/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R1_001_fastqc.zip' ] )

This code chunk will invoke the bash script RNAseq/shell_scripts/run.RawReadQC.FastQC.BK-AA-4_S14_L007_R355.sh to execute FastQC on the KO01(BK-AA-1_S11_L007_R355) sequencing library.

After the completion of the entire pipeline, you can check FastQC report per individual libraries; for instance, the partial report of KO01 will be as follows or a full report.

KO01_FastQC

You can check FastQC Help for more details about how to interpret a FastQC report.

Or, compare your reports to the example reports provided by FastQC for a Good Illumina Data or Bad Illumina Data.

Step 2.1: Read Alignment | Top

In this step, the pipeline will conduct read alignment on the raw fastq files.

The BDS code snippet for the sample KO01 will look like:

$ grep -A1 run.alignRead.star.BK-AA-4_S14_L007_R355.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.bam' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646/data/180216_K00242_0355_AHT7NTBBXX-RKeenan-AA-RS10/FastQ/BK-AA-4_S14_L007_R1_001.fastq.gz' ], cpus := 4, mem := 64*G, timeout := 72*hour, taskName := "star.BK-AA-4_S14_L007_R355") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alignRead.star.BK-AA-4_S14_L007_R355.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.bam' ] )

This code chunk will invoke the bash script RNAseq/shell_scripts/run.alignRead.star.BK-AA-4_S14_L007_R355.sh to execute STAR on the KO01(BK-AA-4_S14_L007_R355) sequencing library.

After the completion of the entire pipeline, you can check the alignment result of each individual libraries; for instance, the result of KO01(BK-AA-4_S14_L007_R355) will be as follows.

$ tree RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355
RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355
|-- BK-AA-4_S14_L007_R355.star.Aligned.sortedByCoord.out.bam
|-- BK-AA-4_S14_L007_R355.star.Log.final.out
|-- BK-AA-4_S14_L007_R355.star.Log.out
|-- BK-AA-4_S14_L007_R355.star.Log.progress.out
|-- BK-AA-4_S14_L007_R355.star.SJ.out.tab
|-- BK-AA-4_S14_L007_R355.star.Unmapped.out.mate1
|-- BK-AA-4_S14_L007_R355.star.bai
|-- BK-AA-4_S14_L007_R355.star.bam -> BK-AA-4_S14_L007_R355.star.Aligned.sortedByCoord.out.bam
`-- run.alignRead.star.BK-AA-4_S14_L007_R355.log

You can check a log file (e.g., RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.Log.final.out) for more alignment information provided by STAR.

$ cat RNAseq/Aln/star/KO01/BK-AA-4_S14_L007_R355/BK-AA-4_S14_L007_R355.star.Log.final.out
##                                  Started job on |    Nov 07 12:19:19
##                              Started mapping on |    Nov 07 12:19:34
##                                     Finished on |    Nov 07 12:21:59
##        Mapping speed, Million of reads per hour |    430.40
## 
##                           Number of input reads |    17335560
##                       Average input read length |    50
##                                     UNIQUE READS:
##                    Uniquely mapped reads number |    14809963
##                         Uniquely mapped reads % |    85.43%
##                           Average mapped length |    49.84
##                        Number of splices: Total |    2272172
##             Number of splices: Annotated (sjdb) |    2259453
##                        Number of splices: GT/AG |    2249028
##                        Number of splices: GC/AG |    19534
##                        Number of splices: AT/AC |    2520
##                Number of splices: Non-canonical |    1090
##                       Mismatch rate per base, % |    0.20%
##                          Deletion rate per base |    0.00%
##                         Deletion average length |    1.44
##                         Insertion rate per base |    0.00%
##                        Insertion average length |    1.28
##                              MULTI-MAPPING READS:
##         Number of reads mapped to multiple loci |    0
##              % of reads mapped to multiple loci |    0.00%
##         Number of reads mapped to too many loci |    2403663
##              % of reads mapped to too many loci |    13.87%
##                                   UNMAPPED READS:
##        % of reads unmapped: too many mismatches |    0.00%
##                  % of reads unmapped: too short |    0.56%
##                      % of reads unmapped: other |    0.15%
##                                   CHIMERIC READS:
##                        Number of chimeric reads |    0
##                             % of chimeric reads |    0.00%

Step 2.2: Alignment QC | Top

In this step, the pipeline will conduct a QC on alignment result.

The BDS code snippets for the sample KO01 will look like:

$ grep -A1 run.alnQC.*.star.KO01.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "picard.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.xls', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.r', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 8, mem := 64*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.clipping_profile.py.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.xls', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.r', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.clipping_profile.pdf' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 8, mem := 64*G, timeout := 72*hour, taskName := "rseqc.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.alnQC.rseqc.star.KO01.infer_experiment.py.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt' ] )

This code chunk will invoke few bash scripts (e.g., RNAseq/shell_scripts/run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.sh, run.alnQC.rseqc.star.KO01.clipping_profile.py.sh, and run.alnQC.rseqc.star.KO01.infer_experiment.py.sh) to execute alignment QC tools (i.e., Picard and RSeQC) on the sample KO01.

After the completion of the entire pipeline, you can check the alignment QC results of each individual samples; for instance, the results of KO01 will be as follows.

$ tree RNAseq/AlnQC/*/star/KO01
RNAseq/AlnQC/picard/star/KO01
|-- KO01.star.picard.RNA_Metrics
|-- KO01.star.picard.RNA_Metrics.pdf
|-- run.alnQC.picard.star.KO01.CollectRnaSeqMetrics.log
`-- tmp
RNAseq/AlnQC/rseqc/star/KO01
|-- KO01.star.rseqc.clipping_profile.pdf
|-- KO01.star.rseqc.clipping_profile.r
|-- KO01.star.rseqc.clipping_profile.xls
|-- KO01.star.rseqc.infer_experiment.txt
|-- run.alnQC.rseqc.star.KO01.clipping_profile.py.log
`-- tmp

You can check alignment statistics (e.g., RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics) for more information provided by Picard.

$ head RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics
## ## htsjdk.samtools.metrics.StringHeader
## # picard.analysis.CollectRnaSeqMetrics REF_FLAT=/gpfs/data/bioinformatics/cri_rnaseq_2018/example/references/v28_92_GRCh38.p12/genes.refFlat.txt RIBOSOMAL_INTERVALS=/gpfs/data/bioinformatics/cri_rnaseq_2018/example/references/v28_92_GRCh38.p12/GRCh38_rRNA.bed.interval_list STRAND_SPECIFICITY=SECOND_READ_TRANSCRIPTION_STRAND CHART_OUTPUT=/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics.pdf METRIC_ACCUMULATION_LEVEL=[SAMPLE, ALL_READS] INPUT=/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bam OUTPUT=/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/KO01.star.picard.RNA_Metrics TMP_DIR=[/gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/AlnQC/picard/star/KO01/tmp]    MINIMUM_LENGTH=500 RRNA_FRAGMENT_PERCENTAGE=0.8 ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
## ## htsjdk.samtools.metrics.StringHeader
## # Started on: Wed Nov 07 12:32:15 CST 2018
## 
## ## METRICS CLASS picard.analysis.RnaSeqMetrics
## PF_BASES PF_ALIGNED_BASES    RIBOSOMAL_BASES CODING_BASES    UTR_BASES   INTRONIC_BASES  INTERGENIC_BASES    IGNORED_READS   CORRECT_STRAND_READS    INCORRECT_STRAND_READS  PCT_RIBOSOMAL_BASES PCT_CODING_BASES    PCT_UTR_BASES   PCT_INTRONIC_BASES  PCT_INTERGENIC_BASES    PCT_MRNA_BASES  PCT_USABLE_BASES    PCT_CORRECT_STRAND_READS    MEDIAN_CV_COVERAGE  MEDIAN_5PRIME_BIAS  MEDIAN_3PRIME_BIAS  MEDIAN_5PRIME_TO_3PRIME_BIAS    SAMPLE  LIBRARY READ_GROUP
## 1582833150   1577842710  334200  876931136   608664079   71989244    19924828    0   4418837 102189  0.000212    0.555779    0.385757    0.045625    0.012628    0.941536    0.938567    0.977397    0.904811    0.026924    0.99656 0.041207            
## 1582833150   1577842710  334200  876931136   608664079   71989244    19924828    0   4418837 102189  0.000212    0.555779    0.385757    0.045625    0.012628    0.941536    0.938567    0.977397    0.904811    0.026924    0.99656 0.041207    unknown     

Or, the respective coverage plot of the sample KO01 produced by Picard will be as follows.

KO01_coverage

There are two alignment measurements performed using RSeQC.

  1. clipping_profile.py
    • Calculate the distributions of clipped nucleotides across reads
  2. infer_experiment.py
    • Use to “guess” how RNA-seq sequencing was configured, particularly how reads were stranded for strand-specific RNA-seq data, through comparing the “strandedness of reads” with the “strandedness of transcripts”.

The results will be as follows. Please check the RSeQC website for more measurements and details.

KO01_clipping_profile

$cat RNAseq/AlnQC/rseqc/star/KO01/KO01.star.rseqc.infer_experiment.txt
## 
## 
## This is SingleEnd Data
## Fraction of reads failed to determine: 0.0757
## Fraction of reads explained by "++,--": 0.0179
## Fraction of reads explained by "+-,-+": 0.9064

Here, you can confirm again the sample KO01 is a single-end, strand-specific library using the second-strand synthesis.

Step 3: Expression Quantification | Top

In this step, the pipeline will conduct expression quantification over alignments.

The BDS code snippet for the sample KO01 will look like:

$ grep -A1 run.quant.featurecounts.star.KO01.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bai' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "featurecounts.star.KO01") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.quant.featurecounts.star.KO01.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count' ] )

This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.quant.featurecounts.star.KO01.sh) to execute expression quantification tool (i.e., Subread::featureCounts on the sample KO01.

After the completion of the entire pipeline, you can check the quantification results of each individual samples; for instance, the results of KO01 will be as follows.

$ tree RNAseq/Quantification/featurecounts/star/KO01
RNAseq/Quantification/featurecounts/star/KO01
|-- KO01.star.featurecounts.count
|-- KO01.star.featurecounts.count.jcounts
|-- KO01.star.featurecounts.count.summary
`-- run.quant.featurecounts.star.KO01.log

You can check quantification statistics (e.g., RNAseq/Quantification/featurecounts/star/KO01KO01.star.featurecounts.count.summary) for more information provided by Subread::featureCounts

$ cat RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count.summary
## Status   /gpfs/data/bioinformatics/wchan10/CRI-BIO-646-BMB-RKeenan/RNAseq/Aln/star/KO01/KO01.star.bam
## Assigned 28638683
## Unassigned_Unmapped  0
## Unassigned_MappingQuality    0
## Unassigned_Chimera   0
## Unassigned_FragmentLength    0
## Unassigned_Duplicate 0
## Unassigned_MultiMapping  0
## Unassigned_Secondary 0
## Unassigned_Nonjunction   0
## Unassigned_NoFeatures    2036590
## Unassigned_Overlapping_Length    0
## Unassigned_Ambiguity 981390

Or, the top 10 most abundant genes in the sample KO01 will be as follows.

$ cat <(head -n2 RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count | tail -n+2 | cut -f1,7) <(cut -f1,7 RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count | sort -k2,2nr | head)
Top 10 most abundant genes in KO01
Geneid Chr Start End Strand Length KO01
ENSG00000210082.2 chrM 1671 3229
1559 129707
ENSG00000198804.2 chrM 5904 7445
1542 129552
ENSG00000167658.15 chr19 3976056 3985469
4027 105680
ENSG00000229807.10 chrX 73820651 73851091
19961 99939
ENSG00000074800.14 chr1 8861000 8879190
5276 88419
ENSG00000198886.2 chrM 10760 12137
1378 86567
ENSG00000111640.14 chr12 6533927 6538371
2981 79933
ENSG00000184009.10 chr17 81509971 81523847
2993 75963
ENSG00000198712.1 chrM 7586 8269
684 70428
ENSG00000067225.17 chr15 72199029 72231822
8571 69910

Step 4-1: Identify Differentially Expressed Genes (DEGs) | Top

In this step, the pipeline will identify differentially expressed genes (DEG) according to the alignment result files (i.e., BAM files) after the alignment step.

The BDS code snippets for the example dataset will look like:

$ grep -A1 run.call.*.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.count.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.DEG.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.call.edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.count.txt' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.DEG.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.call.deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt' ] )
--
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.DEG.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT01/WT01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT02/WT02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/WT03/WT03.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO01/KO01.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO02/KO02.star.featurecounts.count', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/Quantification/featurecounts/star/KO03/KO03.star.featurecounts.count' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.call.limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.txt' ] )

This code chunk will invoke few bash scripts (e.g., RNAseq/shell_scripts/run.call.edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh, run.call.deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh, and run.call.limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to execute differential expression (DE) analysis using three the state-of-the-art tools (i.e., edgeR, DESeq2, and limma) on the example dataset of six samples from KO01 to WT03.

There are three DE analysis tools used in the current pipeline, including

  1. edgeR: Empirical Analysis of Digital Gene Expression Data in R
  2. DESeq2: Differential gene expression analysis based on the negative binomial distribution
  3. limma: Linear Models for Microarray Data

After the completion of the entire pipeline, you can check the calling results of each individual methods; for instance, the analysis results of the example dataset will be as follows.

$ tree RNAseq/DEG/*/featurecounts/star/
RNAseq/DEG/deseq2/featurecounts/star/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.RData
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.ntd.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.ntd.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.rld.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.rld.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.vst.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.vst.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.plotDispEsts.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.plotMA.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.DEG.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.txt
`-- run.call.deseq2.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log
RNAseq/DEG/edger/featurecounts/star/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.RData
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.count.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.plotBCV.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.plotMA.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.plotSmear.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.DEG.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.txt
`-- run.call.edger.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log
RNAseq/DEG/limma/featurecounts/star/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.RData
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.count.voom.meanSdPlot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.plotMA.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.DEG.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.voom.mean-variance.pdf
`-- run.call.limma.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log

You can check statistical test results per gene (e.g., RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.txt) for more information generated by each method.

$ head RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.txt
Statistical test result per gene using DESeq2
Geneid baseMean log2FoldChange lfcSE stat pvalue FDR DEG
8943 ENSG00000139219.18 7593.6731 3.364988 0.0909209 37.01005 0 0 1
9845 ENSG00000092067.5 3022.5103 3.543611 0.0958003 36.98957 0 0 1
14075 ENSG00000101160.13 1270.4139 3.254167 0.1034734 31.44932 0 0 1
13557 ENSG00000142530.10 1241.2705 -3.737343 0.1221921 -30.58581 0 0 -1
9650 ENSG00000136167.13 2428.1942 2.931033 0.0974567 30.07524 0 0 1
8362 ENSG00000084207.16 1082.8763 3.350925 0.1188343 28.19830 0 0 1
9368 ENSG00000111331.12 575.0466 3.131198 0.1228025 25.49785 0 0 1
5104 ENSG00000135346.8 374.1435 -5.235472 0.2229213 -23.48574 0 0 -1
14910 ENSG00000165194.15 1085.3843 -2.145027 0.0973406 -22.03630 0 0 -1
2440 ENSG00000163053.10 2297.1436 -1.763135 0.0818927 -21.52983 0 0 -1

Or, the exploratory plots of the example dataset produced by DESeq2 will be as follows.

  1. An exploratory plot of the per-gene dispersion estimates together with the fitted mean-dispersion relationship
    • KO01_DispEsts
  2. An exploratory plot of row standard deviations versus row means using the normalized counts transformation (f(count + pc))
    • KO01_ntd_meanSd
  3. An exploratory plot of row standard deviations versus row means using the variance stabilizing transformation
    • KO01_rld_meanSd
  4. An exploratory plot of row standard deviations versus row means using the ‘regularized log’ transformation
    • KO01_vst_meanSd
  5. A MA plot of log2 fold changes (on the y-axis) versus the mean of normalized counts (on the x-axis)
    • KO01_plotMA
  6. A scatter plot of log2 fold changes (on the y-axis) versus the FDR (on the x-axis)
    • KO01_plotMA

Step 4-2: DEG Statistics | Top

In this step, the pipeline will collect DEG statistics and identify the overlapping set of identified DEGs from the previous methods.

The BDS code snippet for the sample KO01 will look like:

$ grep -A1 run.lociStat.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/edger/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.edger.test.DEG.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.test.DEG.txt', '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/limma/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.limma.test.DEG.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "lociStat.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.lociStat.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ] )

This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.lociStat.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to collect DEG statistics and to make a Venn diagram plot.

After the completion of the entire pipeline, you can check the statistics result of DEGs per method; for instance, the example dataset CRI-BIO-646 will be as follows.

$ grep -A5 'Up/Down regulated DEGs per methods' RNAseq/LociStat/featurecounts/star/*/run.lociStat.featurecounts.star.*.log | tail -n+2
## grep: result/run.lociStat.featurecounts.star.*.log: No such file or directory
$ cut -f1,2,4 RNAseq/LociStat/featurecounts/star/*/*.star.featurecounts.VennList.txt
DEG Statistics
Methods Method.Num ID.Num
edger 1 16
deseq2 1 3
limma 1 68
edger&deseq2 2 45
edger&limma 2 107
deseq2&limma 2 3
edger&deseq2&limma 3 2669

There is a Venn diagram plot will be generated after this step.

Venn

Step 5: Sample Correlation | Top

In this step, the pipeline will make a PCA plot based on the transcriptional profiling of all samples.

The BDS code snippet for the sample KO01 will look like:

$ grep -A1 run.quantQC.pca.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/QuantQC/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.pca.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/DEG/deseq2/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.deseq2.count.txt' ], cpus := 8, mem := 64*G, timeout := 72*hour, taskName := "pca.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.quantQC.pca.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/QuantQC/featurecounts/star/CRI-BIO-646-BMB-RKeenan.star.featurecounts.pca.pdf' ] )

This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.quantQC.pca.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to make a PCA plot based on the alignment quantification result generated by DESeq2 or one of DE analysis tools.

After the completion of the entire pipeline, you can check the PCA plot under the folder of QuantQC/.

PCA

Step 6: Heat Map | Top

In this step, the pipeline will make a heat map based on the overlapping set of DEGs identified across different DE analysis tools.

The BDS code snippet for the sample KO01 will look like:

$ grep -A1 run.postAna.pheatmap.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/pheatmap/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.heatmap.pdf' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "postAna.pheatmap.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.postAna.pheatmap.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/pheatmap/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.heatmap.pdf' ] )

This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.postAna.pheatmap.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to make a heat map plot based on the overlapping set of DEGs identified across different DE analysis tools (i.e., edgeR, DESeq2, and limma).

After the completion of the entire pipeline, you can check the heat map under the folder of PostAna/pheatmap.

CRI-BIO-646_full_Venn

Step 7: Functional Enrichment Analysis | Top

In this step, the pipeline will conduct enrichment analysis and make several exploratory plots based on the overlapping set of DEGs identified across different DE analysis tools.

The BDS code snippet for the projet CRI-BIO-646-BMB-RKeenan will look like:

$ grep -A1 run.postAna.clusterprofiler.featurecounts.star.*.sh Submit_*.bds
dep( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/clusterprofiler/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.txt' ] <- [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/LociStat/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.overlap.txt' ], cpus := 4, mem := 32*G, timeout := 72*hour, taskName := "postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan") sys bash /gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/shell_scripts/run.postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh; sleep 2
goal( [ '/gpfs/data/bioinformatics/username/CRI-BIO-646-BMB-RKeenan/RNAseq/PostAna/clusterprofiler/featurecounts/star/CRI-BIO-646-BMB-RKeenan/CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.txt' ] )

This code chunk will invoke the bash script (e.g., RNAseq/shell_scripts/run.postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan.sh) to conduct enrichment analyses including GO and KEGG pathway erichment analyses as well as gene set enrichment analysis (GSEA).

After the completion of the entire pipeline, you can check the heat map under the folder of PostAna/pheatmap.

$ tree RNAseq/PostAna/clusterprofiler/featurecounts/star/*/
RNAseq/PostAna/clusterprofiler/featurecounts/star/CRI-BIO-646-BMB-RKeenan/
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.cnetplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.dotplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.emapplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGO.ALL.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGSEAGO.ALL.pos001.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichGSEAGO.ALL.txt
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.cnetplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.dotplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.emapplot.pdf
|-- CRI-BIO-646-BMB-RKeenan.star.featurecounts.enrichKEGG.txt
`-- run.postAna.clusterprofiler.featurecounts.star.CRI-BIO-646-BMB-RKeenan.log

Below, several plots will be generated based on the overlapping set of DEGs using GO database as example.

  1. An exploratory dot plot for enrichment result
    • CRI-BIO-646_full_enrichGO.ALL.dotplot
  2. An exploratory enrichment map for enrichment result of over-representation test
    • CRI-BIO-646_full_enrichGO.ALL.emapplot
  3. An exploratory Gene-Concept Network plot of over-representation test
    • CRI-BIO-646_full_enrichGO.ALL.cnetplot
  4. An exploratory plot of GSEA result with NES great than zero
    • CRI-BIO-646_full_enrichGSEAGO.ALL.pos001

BigDataScript Report | Top

Considering the environment setting in the CRI HPC system, BigDataScript was used as a job management system in the current development to achieve an automatic pipeline. It can handle the execution dependency of all sub-task bash scripts and resume from a failed point, if any.

After the completion of the entire pipeline, you will see a BigDataScript report in HTML under the pipeline folder. For instance, this is the report from one test run. The graphic timeline will tell you the execution time per sub-task script.

Submit.RNAseq.CRI-BIO-646.bds

[Last Updated on 2018/11/07] | Top